AITopics | fp8 format

Collaborating Authors

fp8 format

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Hybrid 8-bit Floating Point (HFP8) Training and Inference for Deep Neural Networks

Xiao Sun, Jungwook Choi, Chia-Yu Chen, Naigang Wang, Swagath Venkataramani, Vijayalakshmi (Viji) Srinivasan, Xiaodong Cui, Wei Zhang, Kailash Gopalakrishnan

Neural Information Processing SystemsFeb-12-2026, 10:11:34 GMT

Neural Information Processing Systems http://nips.cc/

accuracy, degradation, precision, (14 more...)

Neural Information Processing Systems

Country:

North America > United States (0.04)
North America > Canada (0.04)
Asia > South Korea (0.04)
Asia > Middle East > Jordan (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

FP8 Quantization: The Power of the Exponent Andrey Kuzmin, Mart V an Baalen

Neural Information Processing SystemsFeb-9-2026, 07:46:13 GMT

Neural network quantization is one of the most effective ways to improve the efficiency of neural networks. Quantization allows weights and activations to be represented in low bit-width formats, e.g. 8 bit integers (INT8).

artificial intelligence, machine learning, quantization, (18 more...)

Neural Information Processing Systems

Country: Europe > Belgium > Brussels-Capital Region > Brussels (0.04)

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

FP8 Quantization: The Power of the Exponent

Neural Information Processing SystemsDec-24-2025, 07:27:17 GMT

When quantizing neural networks for efficient inference, low-bit integers are the go-to format for efficiency. However, low-bit floating point numbers have an extra degree of freedom, assigning some bits to work on an exponential scale instead. This paper in-depth investigates this benefit of the floating point format for neural network inference. We detail the choices that can be made for the FP8 format, including the important choice of the number of bits for the mantissa and exponent, and show analytically in which settings these choices give better performance. Then we show how these findings translate to real networks, provide an efficient implementation for FP8 simulation, and a new algorithm that enables the learning of both the scale parameters and number of exponent bits in the FP8 format. Our chief conclusion is that when doing post-training quantization for a wide range of networks, the FP8 format is better than INT8 in terms of accuracy, and the choice of the number of exponent bits is driven by the severity of outliers in the network. We also conduct experiments with quantization-aware training where the difference in formats disappears as the network is trained to reduce the effect of outliers.

fp8 format, fp8 quantization, name change, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.50)

Add feedback

Hybrid 8-bit Floating Point (HFP8) Training and Inference for Deep Neural Networks

Xiao Sun, Jungwook Choi, Chia-Yu Chen, Naigang Wang, Swagath Venkataramani, Vijayalakshmi (Viji) Srinivasan, Xiaodong Cui, Wei Zhang, Kailash Gopalakrishnan

Neural Information Processing SystemsOct-2-2025, 21:26:46 GMT

Neural Information Processing Systems http://nips.cc/

artificial intelligence, deep learning, machine learning, (17 more...)

Neural Information Processing Systems

Country: North America (0.28)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

As deep learning methodologies have developed, it has been generally agreed that increasing neural network size improves model quality. However, this is at the expense of memory and compute requirements, which also need to be increased. Various efficiency techniques have been proposed to rein in hardware costs, one being the use of low precision numerics. Recent accelerators have introduced several different 8-bit data types to help accommodate DNNs in terms of numerics. In this paper, we identify a metric driven methodology to aid in the choice of numerics. We demonstrate how such a methodology can help scale training of a language representation model. The technique can be generalized to other model architectures.

matrix multiplication, neural network, quantization, (15 more...)

arXiv.org Artificial Intelligence

2408.02897

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

Efficient Post-training Quantization with FP8 Formats

Shen, Haihao, Mellempudi, Naveen, He, Xin, Gao, Qun, Wang, Chang, Wang, Mengni

arXiv.org Artificial IntelligenceSep-25-2023

Recent advances in deep learning methods such as LLMs and Diffusion models have created a need for improved quantization methods that can meet the computational demands of these modern architectures while maintaining accuracy. Towards this goal, we study the advantages of FP8 data formats for post-training quantization across 75 unique network architectures covering a wide range of tasks, including machine translation, language modeling, text generation, image classification, generation, and segmentation. We examine three different FP8 representations (E5M2, E4M3, and E3M4) to study the effects of varying degrees of trade-off between dynamic range and precision on model accuracy. Based on our extensive study, we developed a quantization workflow that generalizes across different network architectures. Our empirical results show that FP8 formats outperform INT8 in multiple aspects, including workload coverage (92.64% vs. 65.87%), model accuracy and suitability for a broader range of operations. Furthermore, our findings suggest that E4M3 is better suited for NLP models, whereas E3M4 performs marginally better than E4M3 on computer vision tasks. The code is publicly available on Intel Neural Compressor: https://github.com/intel/neural-compressor.

accuracy, fid score, quantization, (13 more...)

arXiv.org Artificial Intelligence

2309.14592

Country:

North America > United States > New York (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

FP8 versus INT8 for efficient deep learning inference

van Baalen, Mart, Kuzmin, Andrey, Nair, Suparna S, Ren, Yuwei, Mahurin, Eric, Patel, Chirag, Subramanian, Sundar, Lee, Sanghyuk, Nagel, Markus, Soriaga, Joseph, Blankevoort, Tijmen

arXiv.org Artificial IntelligenceJun-15-2023

Recently, the idea of using FP8 as a number format for neural network training has been floating around the deep learning world. Given that most training is currently conducted with entire networks in FP32, or sometimes FP16 with mixed-precision, the step to having some parts of a network run in FP8 with 8-bit weights is an appealing potential speed-up for the generally costly and time-intensive training procedures in deep learning. A natural question arises regarding what this development means for efficient inference on edge devices. In the efficient inference device world, workloads are frequently executed in INT8. Sometimes going even as low as INT4 when efficiency calls for it. In this whitepaper, we compare the performance for both the FP8 and INT formats for efficient on-device inference. We theoretically show the difference between the INT and FP formats for neural networks and present a plethora of post-training quantization and quantization-aware-training results to show how this theory translates to practice. We also provide a hardware analysis showing that the FP formats are somewhere between 50-180% less efficient in terms of compute in dedicated hardware than the INT format. Based on our research and a read of the research field, we conclude that although the proposed FP8 format could be good for training, the results for inference do not warrant a dedicated implementation of FP8 in favor of INT8 for efficient inference. We show that our results are mostly consistent with previous findings but that important comparisons between the formats have thus far been lacking. Finally, we discuss what happens when FP8-trained networks are converted to INT8 and conclude with a brief discussion on the most efficient way for on-device deployment and an extensive suite of INT8 results for many models.

artificial intelligence, machine learning, quantization, (19 more...)

arXiv.org Artificial Intelligence

2303.17951

Country: Europe > Belgium > Brussels-Capital Region > Brussels (0.04)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Intel, AMD and Nvidia propose new standard to make AI processing more efficient

#artificialintelligenceSep-14-2022, 18:17:36 GMT

In pursuit of faster and more efficient AI system development, Intel, AMD and Nvidia today published a draft specification for what they refer to as a common interchange format for AI. While voluntary, the proposed "8-bit floating point (FP8)" standard, they say, has the potential to accelerate AI development by optimizing hardware memory usage and work for both AI training (i.e., engineering AI systems) and inference (running the systems). When developing an AI system, data scientists are faced with key engineering choices beyond simply collecting data to train the system. One is selecting a format to represent the weights of the system -- weights being the factors learned from the training data that influence the system's predictions. Weights are what enable a system like GPT-3 to generate whole paragraphs from a sentence-long prompt, for example, or DALL-E 2 to create photorealistic portraits from a caption.

amd and nvidia, intel, new standard, (10 more...)

#artificialintelligence

Industry: Information Technology > Hardware (0.70)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.57)

Add feedback